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Abstract 

Background: Supervised machine learning approaches have been recently adopted in the inference of 
transcriptional targets from high throughput trascriptomic and proteomic data showing major improvements from 
with respect to the state of the art of reverse gene regulatory network methods. Beside traditional unsupervised 
techniques, a supervised classifier learns, from known examples, a function that is able to recognize new 
relationships for new data. In the context of gene regulatory inference a supervised classifier is coerced to learn 
from positive and unlabeled examples, as the counter negative examples are unavailable or hard to collect. Such a 
condition could limit the performance of the classifier especially when the amount of training examples is low. 

Results: In this paper we improve the supervised identification of transcriptional targets by selecting reliable 
counter negative examples from the unlabeled set. We introduce an heuristic based on the known topology of 
transcriptional networks that in fact restores the conventional positive/negative training condition and shows a 
significant improvement of the classification performance. We empirically evaluate the proposed heuristic with the 
experimental datasets of Escherichia coli and show an example of application in the prediction of BCL6 direct core 
targets in normal germinal center human B cells obtaining a precision of 60%. 

Conclusions: The availability of only positive examples in learning transcriptional relationships negatively affects 
the performance of supervised classifiers. We show that the selection of reliable negative examples, a practice 
adopted in text mining approaches, improves the performance of such classifiers opening new perspectives in the 
identification of new transcriptional targets. 



Background 

An important challenge of computational biology is the 
reconstruction of large biological networks from high 
throughput genomic and proteomic data. Biological net- 
works are used to represent and model molecular interac- 
tions between biological entities, such as genes and 
proteins in a given biological context. 

In this paper we focus on the identification of new tran- 
scriptional targets, i.e. coding DNA regions directly regu- 
lated by transcription-factors. Transcription factors are 
proteins, coded by specific genes, that, alone or with other 
proteins in a complex, bind the targets cis-regulatory 
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regions and control the target transcriptional activity 
by promoting or blocking the recruitment of RNA 
polymerase. 

In identifying the interactions between transcription- 
factors and genes from experimental data, two broad 
classes of computational methods can be distinguished 
in literature [1,2]: those that rely on the physical interac- 
tion between molecules (gene-to-sequence interaction) 
which relate transcription factors to sequence motifs 
found in promoter regions; and algorithms based on the 
influence interaction that try to relate the expression of 
a gene to the expression of the other genes in the cell 
(gene-to-gene interaction). Most of the approaches of 
the second class are basically unsupervised and model 
the reconstruction of transcriptional relationships as a 
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classification problem, where the basic decision is the pre- 
sence or absence of a relationship between a given pair of 
genes [3-6]. Those methods can be distinguished in: 
i) gene relevance network models, which detect gene-gene 
interactions with a similarity measure and a threshold, 
such as ARACNE [7], TimeDelay-ARACNE [8], and CLR 
[9] that infer the network structure with a statistical score 
derived from the mutual information and a set of pruning 
heuristics; ii) boolean network models, which adopt a bin- 
ary variable to represent the state of a gene activity and a 
directed graph, where edges are represented by boolean 
functions (e.g. REVEAL [10]); iii) differential and difference 
equation models, which describe gene expression changes 
as a function of the expression level of other genes with a 
set of ordinary differential equations (ODE) [11]; and iv) 
Bayesian models, or more generally graphical models, 
which adopt Bayes rules and consider gene expressions as 
random variables [12]. 

The experimental validation of predicted transcriptional 
regulations is performed with ChlP-on-chip [13], a techni- 
que used to investigate interactions between proteins and 
DNA in vivo by combining chromatin immuno-precipita- 
tion (ChIP) with microarray technology (chip). Specifically, 
it allows the identification of the cistrome, sum of binding 
sites, for DNA-binding proteins on a genome-wide basis. 
Whole-genome analysis can be performed to determine 
the locations of binding sites for almost any protein of 
interest, in particular transcription factors. The goal of 
ChlP-on-chip is to localize protein binding sites that may 
help identify functional elements in the genome. For 
example, in the case of a transcription factor as a protein 
of interest, one can determine its transcription factor bind- 
ing sites throughout the genome. 

A recent trend in computational biology aims recon- 
struct large biological networks with supervised approaches 
[5,6,14]. Supervised methods require a training set, which 
in our context means a set of transcriptional targets where 
the information that they are regulated by a transcription 
factor is known in advance. Training targets are used to 
estimate a function that is able to discriminate whether a 
new transcriptional interaction exists. The literature of 
machine learning proposed several supervised algorithms: 
neural networks, decision tree, logistic models, and Sup- 
port Vector Machines (SVM) [15]. Among all SVM gave 
promising results in the reconstruction of biological net- 
works [16-18]. For example, SIRENE adopted an SVM 
classifier to reconstruct the regulatory network of Escheri- 
chia coli, and obtained more accurate results than unsuper- 
vised methods based on mutual information (e.g. CLR and 
ARACNE) [16]. Compared to unsupervised methods, 
supervised methods are potentially more accurate, but in 
fact they need an initial set of known regulatory connec- 
tions. This is in principle not a restriction as many regula- 
tions are progressively discovered and shared among 



researchers through public regulatory databases. Some 
examples are: RegulonDB (http://regulondb.ccg.unam.mx), 
KEGG (http://www.genome.jp/kegg/), TRRD (http://www. 
mgs.bionet.nsc.ru/mgs/gnw), Transfac (http://www.gene- 
regulation.com), IPA (http://www.ingenuity.com). 

In general a supervised binary classifier needs both 
positive and negative examples to learn effectively. In the 
context of gene regulatory networks this condition is not 
satisfied, as counter negative examples are not available 
or may be hard to collect. In functional genomics, infor- 
mation about negative examples is in fact not available, 
as the input is usually a finite list of genes known to have 
a given function or to be associated to a given disease, 
and the goal is to identify new genes sharing the same 
property. Thus, under a machine learning perspective, 
the supervised inference of new transcriptional targets 
falls into a class of semi-supervised learning problems 
that consists of learning from positive and unlabeled 
data. The training set is composed just by known positive 
examples {positive set) and the goal is to predict the 
unknown positive in examples the unlabeled set. 

Learning from only positive and unlabeled data is a hot 
topic in the literature of data mining, where three main 
families of approaches can be distinguished [19]: i) meth- 
ods that reduce the problem to the traditional two-class 
learning by selecting reliable negative examples from the 
unlabeled set [20-25]; ii) methods that do not need labeled 
negative examples and basically adjust the probability of 
being positive estimated by a traditional classifier trained 
with positive and unlabeled examples [14,26]; and iii) 
methods that treat the unlabeled set as noisy negative 
examples [27]. 

In this paper we focus on the first class of approaches 
that rely on a starting selection of negative examples. The 
main problem is that some of the selected negative exam- 
ples could in fact be positives embedded in the unlabeled 
data, reducing the prediction capability of a binary classifier. 
We empirically evaluate this effect by simulating the posi- 
tive contamination inside the negative training set showing 
that the performance of the classifier improves when the 
positive contamination is low. Such a result demands for 
an approach that is able to generate a sufficiently large 
negative training set without positive contamination. 

We propose, NOIT (NOn Indirect Targets), a method to 
select reliable negative training examples by exploiting the 
known gene regulatory network topology in the specific 
context of prediction new transcriptional targets. The 
method is an extension, to a specific context, of approaches 
recently published in [28] and [29] where reliable negatives 
selection benefits from the over presence, in the current 
known gene regulatory networks, of typical network motifs 
[30]. We introduce a new heuristic that still exploits the 
known regulatory network topology but not in terms of 
network motifs as in the specific context of transcriptional 
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target prediction the relationships between transcription- 
factors and their targets does not exhibit significant net- 
work patterns. The NOIT method gives less importance to 
indirect targets, i.e. targets of a transcription-factor regu- 
lated indirectly through other transcription-factors. The 
idea is based on the observation that genes controlled 
directly by a transcription factor or indirectly through 
other transcription factors are likely to attain for the same 
family of functions, thus representing unreliable negative 
candidates. This is supported by the fact that transcription 
factors evolved in the service of specific biological functions 
and are usually classified according to their regulatory 
function [31] and sequence similarity [32,33]. Moreover 
downstream targets activity is usually modulated by regula- 
tory circuits involving small groups of transcription factors 
organized in typical network motifs. 

We compare NOIT with other negative selection 
approaches known in literature. For this purpose we adopt 
the dataset of Escherichia coli, where almost all transcrip- 
tional regulations are known and a huge amount of experi- 
mental data is available for benchmarking (e.g. Faith et al. 
[34]). Furthermore we provide an example of application 
in the prediction of BCL6 direct targets in normal germ- 
inal center human B cells by adopting the results of Basso 
et al. [13] showing that NOIT predicts 29 correct targets 
in the top 50 ranked genes, outperforming other super- 
vised and unsupervised methods that predict less than 10 
correct targets. The paper is organized as follows. The 
next section (Methods) introduces the NOIT heuristic, 
overviews the literature methods that are based on a reli- 
able negative selection procedure, and describes the 
empirical procedures aimed at evaluating the performance 
of the negative selection heuristics. Section on results 
reports and discusses the outcomes of the study, and the 
last section concludes the paper outlining directions for 
future work. 

Methods 

Problem formulation 

In a binary classification problem, given a set of training 
examples, (x lt y Y ), (x 2 , y 2 ), •••> (x„, y„) e X x {+1,-1}, the 
goal is to determine a function /{#): X — > {-1,+1} that is 
able to predict the label y e {+l,-l}of a new observation 
x e X. Machine learning algorithms infer an estimate of 
the function / from the available examples. To distinguish 
effectively whether a new observation is positive or nega- 
tive, the training set should contain a sufficient number of 
both positive and negative examples. Such a conventional 
condition does not hold in the problem we aim to forma- 
lize as the training set is composed by only positive exam- 
ples. In the context of transcriptional target prediction 
negative counter examples are in principle not available as 
the nonexistence of a transcriptional activity is hard to be 
experimentally verified. Liu et al. [20] theoretically showed 



that a statistical classifier may take advantage from unla- 
beled examples, and that if the sample size is large enough, 
the classifier could converge to a good classifier by maxi- 
mizing the number of unlabeled examples classified as 
negative while constraining the positive examples to be 
correctly classified. The selection of reliable negatives 
from the unlabeled set could be crucial for the quality of 
a positive only classifier. With those examples a classifier 
could be trained with a traditional two-class set under the 
control of a convergence condition. The selection of reli- 
able negative training examples may, or may not, exploit 
the underlying application domain. For example, in the 
classification of web documents, reliable negative docu- 
ments are those that do not contain any of the most 
frequent words extracted from known positive docu- 
ments [35]. 

We propose, NOIT (NOn Indirect Targets), a negative 
selection heuristic that exploits the known regulatory net- 
work topology by giving less importance to indirect targets, 
and formalized as follows. Let G be the set of all genes in 
an organism and TF c G the set of transcription factors. 
Given a transcription factor tf t e TF, the goal is to infer 
a function, ftfX^ig)) '■ G — » {— 1, +1}, from a set of tar- 
get genes, P tfi = {(gi, +1), (gi, +1), {gn, +1)} C (G\TF), 
that are known to be regulated directly by tfi (i.e. positive 
examples). The function should be able to predict the label 
j of a new gene g e 17;/. = G\(TF U P t f t ) from the unlabeled 
set. The transformation function (p describes each gene 
with an w-dimensional real valued feature vector, 
(p(g) : G -»■ W, such as expression values measured in m 
different experimental conditions. 

The goal of a negative selection heuristic is to select 
from the unlabeled set U t f t a sufficiently large negative 
training set without positive contamination. Our aim is to 
propose a method based on the assumption that an unla- 
beled gene g e U t f t is a bad negative candidate if it is indir- 
ectly controlled by tf it through other transcription factors. 
Such information can be extracted from the known gene 
regulatory network, or in the situation wherein such infor- 
mation is not available, it could be estimated with binding 
site promoter analysis [32] and/or unsupervised gene regu- 
latory prediction [7,9] . 

We introduce a probability mass function pmf t f t (g) of 
negative candidates distribution to estimate the probability 
that an example g € is a good negative candidate. 
We compute prnfe (g) as: 

where k& [1, \ TF\] is the minimum number of tran- 
scription factors, tf i+1 , tfi+2, tfub that link tfi to g, i.e. for 
every / = i, i + k -1, tf j+1 is a known target of tfi . The 
term 1/ll/t/J serves to scale the probability mass function 
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to sum to 1. When a path linking tf t and g through a set of 
known transcription factors does not exist, we assume that 
k = | TF\ . In that case the probability is maximum, instead 
it is minimum when at least one tf k exists such that 
g is regulated by tf k and tf k is regulated by tf t (Figure 1). 
The hypothesis is that the expression profile of genes 
regulated by tf t are more correlated with genes selected as 
bad negatives than those selected as good negatives. This 
is confirmed with a bootstrapping experiment where we 
selected (many times, e.g. 1000) two random genes, g 1 and 
g 2 , belonging to the targets of a transcription factor, and 
two genes, g goo d and gb a d> belonging respectively to good 
and bad negative candidates as selected by the NOIT pro- 
cedure. We computed the correlation between g 1 -g 2 , 
gi-ggood, and g! -g bad obtaining the three distributions 
shown in Figure 2. The black curve shows the distribution 



of correlation between genes within the same targets, the 
red curve shows the distribution of correlation between 
targets and bad negative candidates, and the green curve 
shows the distribution of correlation between targets and 
bad negative candidates. A two sample Mann-Withney 
Test between the latter two distributions shows a signifi- 
cant difference (W = 5940280284, p-value < 2.2 x 10' 16 ) 
suggesting that the NOIT procedure is able to select nega- 
tive that are more distant, in term of correlation, from tar- 
gets. With a learning scheme similar to SIRENE [16] we 
divide the unlabeled set U t f t into three random folds. The 
labels of each fold are predicted with a binary classifier 
trained with the known positives and a selection of nega- 
tive examples drawn from the other two folds. SIRENE 
adopts a method, known as PU learning (Positive Unla- 
beled learning), that is strongly affected by the positive 




Known tf1 
target 



Known tf1 
target 



Known tf 1 
target 



Good tf 1 negative 
candidate 



Bad tf 1 negative 
candidate 



Figure 1 The NOIT (NOn Indirect Targets) negative selection heuristic. Reliable negative examples are sampled from the unlabeled set 
distributed according to an heuristic that promotes the non existence of indirect relationships with the current transcription factor. 
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correlation 

Figure 2 Distribution of correlation between gene expression profiles. The distribution of gene expression profiles correlation computed 
between genes regulated by the same transcription factor (black curve), between targets and good negative candidates (red curve), and 
between targets and bad negative candidates (green curve). 



contamination of unlabeled examples as all unlabeled 
examples are considered good negative candidates. We 
limit such a contamination by selecting the top NC nega- 
tive candidates scored by the above introduced probability 
mass function pmf t f t (g). We consider a number of nega- 
tives candidates, NC, depending on the number of known 
positives N C = K * |. The parameter K may affect the 
performance of the classifier. With an experiment per- 
formed in the context of Escherichia Coli we observed on 
the independent test set that the best performance is 
obtained with K around 10 (Figure 3). 

Negative selection methods in literature 

In this Section we briefly review the most important 
positive only classification methods that include a reli- 
able negative selection step in their classification 
schema. 
Spy-SVM 

Spy-SVM is a technique proposed in [20] that works as 
follows. A percentage of known positives, {si, S2, —, s^}, 
randomly selected from Ptf it that act as 'spies', are sent to 
the unlabeled set Utfi. An SVM classification algorithm is 
trained with positive examples (without the spies) and the 



unlabeled set (with the spies) assumed as negatives. The 
spies should behave identically to the unknown positive 
examples belonging to and this allows to reliably infer 
the behavior of the unknown positive examples. A thresh- 
old t is employed to make the decision whether an exam- 
ple in Utfi is a reliable negative or not. Examples with a 
probability of being positive, P(f{x) = +1), lower than t are 
the most likely negative examples. The threshold is intui- 
tively calculated as the minimum of the probability of 
being positive of spies, i.e. t = min{P{f[si) = +1), -P(/(s 2 ) = 
+1), P(j{sk) = +1)}. This means that all the spy examples 
should be classified as positives. 
PSoL - Positive Sample only Learning 

PSoL selects strong negative example using the Euclidean 
distance measure [21]. The algorithm starts with a nega- 
tive candidate that is the most farthest unlabeled example 
from Ptfi calculated as the maximum of the minimum dis- 
tance from the elements of Ptfi. More negative candidates 
are selected from the unlabeled set Uf/, satisfying the con- 
strain that are different from the known positive exam- 
ples and farthest from the previously selected negative 
ones. The algorithm assumes that the negative examples 
in the unlabeled set are located far from positives and 
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Figure 3 Effect of the NOIT parameter K on classifier performance. The parameter K determines the amount of negative candidates that 
will be included in the training set. The figure shows the classifier performance in terms of AUROC for different values of K. Each curve refers to 
a different percentage of known positives. The optimal value can be observed around K = 10. 



from the previous selected negative examples. The last 
condition assures that the negative set spans the whole 
negative examples in the unlabeled set. Given such initial 
negative set, the PSoL method iteratively expands the 
negative set by using a two-class SVM trained with 
known positives and the current negative selection. 
Negative set expansion is repeated until the size of the 
remaining unlabeled set goes below a predefined number. 
At this last step, the unlabeled data points with the lar- 



gest positive decision function values are declared as the 

positives. 

Rocchio-SVM 

Rocchio-SVM is based on a technique adopted in infor- 
mation retrieval to improve the recall of pertinent docu- 
ments through relevance feedback [22]. It identifies the 
set of reliable negatives by adopting two prototypes, one 
for the positive class, c p , and one for the unlabeled ones, 
c u , computed as follows: 
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where a and /3 adjust the relative impact of positive and 
negative training examples. The unlabeled examples that 
are more similar to the unlabeled prototype than to the 
positive one, i.e. sim(g, c p ) <sim(g, c u ), are selected as 
strong negative examples. To compute such a similarity 
the Rocchio technique adopts the cosine similarity. With 
the known positive examples and the selected negative 
examples a conventional SVM classifier is trained and 
then used to classify the remaining set of unlabeled 
examples. 
Bagging - SVM 

Baggin SVM is an ensemble technique that generally 
improves the performance of individual classifiers when 
they are unstable or not correlated to each other. Positive 
only learning have a particular structure that leads to 
instable classifiers due to the positive contamination of 
the unlabeled set which can be advantageously exploited 
by a bagging-like procedure [36,37]. The approach col- 
lects the outcome of a huge number classification runs 
(e.g. 1000), where each classifier, F it is trained with the 
known positive examples, Ptf„ and a random set of NC 
negative candidates drawn uniformly from U t f it consid- 
ered as negative examples. The ensemble classifier, F, 
scores an unlabeled example g by averaging the scores 
obtained by that example at each run: 



where g is a member drawn from 1%, F, is the i-th classi- 
fier, and T g is the set of partial classifiers that were not 
trained with g, i.e. the unlabeled example g was not drawn 
by the random selection. 

Empirical evaluation methods 

In this section we introduce the datasets, the basic learn- 
ing algorithm, and the methods we adopted to empiri- 
cally evaluate to which extend a negative selection 
heuristic improves the performance of a classifier trained 
to infer new transcriptional targets. 
Datasets 

To test our approach we adopt the well known dataset of 
Escherichia coli provided by Faith et al. [34], and a data- 
set that was adopted by Basso et al. [13] to predict BCL6 
direct target genes in normal germinal center human B 
cells. 



The dataset of Escherichia coli consists of 445 different 
Affymetrix Antisense2 microarray expression profiles for 
4345 genes. The transcriptional regulatory network of 
Escherichia coli is the most complete annotated network 
consisting of 3293 experimentally confirmed relation- 
ships between 154 transcription factors and 1211 direct 
targets extracted from RegulonDB (version 5) [38]. 

The dataset of Basso et al. is deposited in the Gene 
Expression Omnibus database and is accessible through 
GEO series accession number GSE12195. It consists of 
136 expression profiles of 73 B-cell lymphoma biopsies, 
10 purified tonsillar geminal center, 10 naive and mem- 
ory B cells, 38 Follicular lymphoma biopsies, and 5 lym- 
phoblastoid cell lines. We normalized the dataset from 
CEL files according to the RMA procedure [39] and fil- 
tered out probes with low inter experiment variability 
by means of the varFilter function of the genefilter Bio- 
conductor package. The final dataset is composed by 
136 samples and 9876 genes. Basso et al. identified a 
group of 120 new core targets down-regulated by BCL6 
with an integrated biochemical-computational-functional 
approach (see Supplemental Table S2 of [13]), validated 
through ChlP-on-chip. 

We show that those 120 new core targets can be pre- 
dicted with a supervised learning approach starting from 
a positive training set of 171 targets annotated as down- 
regulated by BCL6 in a previous work by Ci et al. [40]. 
For the NOIT negative selection procedure we rely on 47 
transcription factors known to be regulated by BCL6 by 
TRANSFAC sequence motifs analysis which considers 
those that exhibit a BCL6-bound enrichment in their 
promoter regions as reported in [13]. Their targets were 
predicted preliminary with ARACNE as reported in the 
supplemental Table 5 of reference [13]. 
Basic Learning algorithm 

We use the Support Vector Machine (SVM), with Piatt 
scaling [41], to estimate the probability that a target is 
regulated by a transcription-factor. In particular we use the 
SVM implementation provided by KERNLAB [42], a pack- 
age for kernel-based machine learning methods in R. The 
basic element of an SVM algorithm is a kernel function K 
(xi, x 2 ), where X\ and x 2 are feature vectors of two gene tar- 
gets. The idea is to construct a separation hyperplane 
between two classes, +1 and -1, such that the distance of 
the hyperplane to the points closest to it is maximized. The 
kernel function implicitly maps the original data into some 
high dimensional feature space, in which the optimal 
hyperplane can be found. In our experiment we adopt an 
SVM classifier for each transcription-factor tfie TF trained 
with the known positive targets and the reliable selection of 
negative examples performed with a negative selection 
approach. Such a classifier in then used to score the set of 
genes g e G\TF according to their probability to be 
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regulated by tf t . We used C-support vector classification 
(C-SVC) which solves the following problem: 

min — a T yiVjK(xi,Xj)a — e T a 

a 2 

subject to: y T a = 0, where e {+l,-l}is the class of 
vector x b 0 < a, < C, i = 1, 2k, e is a vector with all 
elements equal to one, and K(Xj, yj) is a kernel function. 
We adopt a radial basis kernel function defined as: 

where C and y are parameters that we set empirically 
inside the training loop [43]. 
Cross validation and performance measures 

To estimate the unknown performance of a classifier 
designed for discrimination we adopt a workflow consist- 
ing of 5 steps (Figure 4). For each transcription factor tf t e 
TF we partition the original dataset into 10 random folds. 
Alternatively 9 folds are used for training, while the other 
fold is used for testing (step 2). Each fold contains a den- 
sity of positives that is almost similar to the density of 
positives in the original dataset. The known targets regu- 
lated by tfi belonging to the current training set is split 
into a positive set Ptfo assumed to be the known positive 
training set, and an unknown set Qtf it forming with 



the current unlabeled set U t f t (step 3). The size of Ptfi is 
incremented linearly starting from 2 or according to the 

fraction | P | - To limit the selection bias we re-sample 

Ptfj 100 times. The negative training set is extracted from 
the unlabeled set, U t f i (step 4), and adopted, together with 
the current known positives, to train an SVM classifier 
(step 5). Genes belonging to the test set are scored accord- 
ing to the current classifier and the accuracy of classifica- 
tion is evaluated at different ranking levels in terms of 
precision and recall as follows: 

PR - TP " RC TPn 
" n " | targets [tfi) \ 

where TP„ is the number of true positives appearing 
in the top n ranked targets, and targets{tfj) is the set of 
tfi targets we want to predict in each test set. Instead, 
true positive rates and false positive rates are computed 
as: 



| Q t f t | #true negatives 

where #true negatives is the number of true negatives in 
the test set. From those measures we compute also aggre- 
gate performance measures, such as: AUROC (areas under 
the ROC curve) and AUPR (area under the precision/ 
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Figure 4 Evaluation procedure. A negative selection method is evaluated by adopting a completely labeled dataset and a stratified k-fold 
cross validation procedure, where the number of known positives is varied linearly starting from 2 or according to its percentage with respect 
to the unknown positives (from 10% to 100%). To limit the selection bias of known positives, within each k-fold, the percentage of known 
positives is re-sampled 100 times. 
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recall curve). Within a selection of known positives perfor- 
mance measures are averaged among all folds, all positive 
sampling runs, and all transcription factors obtaining an 
overall performance estimation of the classifier. 

Results and discussion 

Effect of positive contamination 

The contamination of the training set with positive exam- 
ples considered wrongly as negatives affects the 
performance of a classifier. We define the level of positive 
contamination as the fraction q of unknown positives, 
with respect to the total number of unknown positives 
(Q), selected wrongly as negatives. Figure 5 shows the 
effect, in terms of AUROC (on the left) and AUPR (on the 
right), of positive contamination in two extreme condi- 
tions: a training set with full positive contamination 
(g = n = 100%) and a training set with no positive con- 
tamination = ^ = 0%). In the first all unknown posi- 
tives have been selected (wrongly) as negatives, U = Q + N. 
Instead, in the second the training set is composed just by 
true negatives, U = N , and represents an ideal classifier 
with a perfect negative selection heuristic. In principle the 
actual performance of a negative selection heuristic should 
be within the area delimited by the two curves. 

Both classifiers have been trained in the context of 
Escherichia coli with the procedure depicted in Figure 4 at 
different levels of known positives (on the x-axis between 



0. 1 and 1). The main effect is that the performance of both 
contaminated and uncontaminated classifiers decreases 
with the fraction of known positives, although the propor- 
tion of that decrement is more rapid for the classifier 
trained with full positive contamination. When the fraction 
of known positives is minimum (0.1) the difference between 
contaminated and uncontaminated classifiers is maximum. 

Effect of the negative selection approach 

The performance of a negative selection approach is 
affected by the proportion of known positives available in 
the training set. With the evaluation procedure depicted 
in Figure 4 we evaluated the performance of a negative 
selection approach by varying both the relative fraction 
and the absolute number of known positives. The latter 
being more in accordance with practical purposes, as 
users only know the total number of positives which they 
have. Figure 6 reports, for each method, the average 
AUROC computed at different fraction of known posi- 
tives (on the left) and at different number of known posi- 
tives (in logarithmic scale on the right). On average the 
performance of each method increases with the quantity 
of known positives. With the exception of Rocchio each 
method reaches the maximum performance (AUROC 
around 0.8) when the training set is completely labeled, 

1. e. the percentage of known positives is maximum 
(100%). At low levels of known positives the difference 
among methods is more significant. Up to a percentage of 
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Figure 5 Effect of positive contamination on classifier performance. Positive contamination, i.e. the fraction of positives in the unlabeled 
training set, affects the performance of a classifier. The figure shows two extreme conditions: a classifier trained with unlabeled data totally 
contaminated with positive examples (100%), and a classifier trained without positive contamination (0%). On the left the performance is shown 
in terms of AUROC (area under the roc curve), while on the right it is shown in terms of AUPR (area under the precision/recall curve). 
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Figure 6 Classification performance of different negative selection methods. The performance of different negative selection methods for 
the prediction of transcriptional targets in Escherichia coli. The figures show the classifier performance in terms of AUROC at different 
percentage of known positives (on the left), and at different number of known positives (on the right). 



60% of known positives, or, up to a number of 20 known 
positives, in the training set, the NOIT procedure outper- 
forms significantly all other methods. At low levels of 
known positives the worst performance is registered by 
PU, as in fact does not adopt any negative selection 
approach. Instead, at high levels of known positives the 
worst performance is registered by Rocchio. 

Table 1 summarizes the performance of each method 
in terms of average Recall computed at 60% and 80% of 
precision. The table reports, at different fraction of 
known positives, the 95% confidence intervals of Recall 
measures and the statistical significance (corrected with 
Benjamini & Hochberg) obtained with a pairwise t-test 
performed between NOIT and each other method. The 
adoption of t-test was preliminarly justified as Recall mea- 
sures follow a normal distribution (Shapiro test, p-value 
< 2.2 • 10' 16 )and the one-way ANOVA test showed that 
Recall measures among methods are significantly different 
(ANOVA, p-value < 2.2 • 10' 16 ). At low levels of known 
positives (precisely at 10% and 30%) the NOIT procedure 
outperforms significantly all other methods (with the 
exception of Bagging that exhibits a marginal significant 
difference when the precision is set to 60%). The incre- 
ment in Recall can be estimated around 10% with respect 
to Bagging which is the current state of the art in super- 
vised inference of gene regulatory connections [16,37]. 

Prediction of BCL6 core targets in GC human B cells 

In order to illustrate an examples of application we pre- 
dict BCL6 core targets in GC human B cells adopting 



data and results provided by Basso et al. [13]. Figure 7 
shows the number of true BCL6 core targets appearing in 
the top n genes ranked by an SVM classifier trained with 
different negative selection approaches. Each classifier 
has been trained by using the previously known targets 
provided by Ci et al. [40] and the predicted ranked set of 
genes has been compared with the BCL6 new core tar- 
gets published by Basso et al. [13]. For the NOIT selec- 
tion procedure we rely on 47 transcription-factors, 
reported in the Supplemental Table S5 of by Basso et al. 
[13], known to be controlled by BCL6 by means of 
TRANSFACT sequence motif analysis. The Figure 
includes also the result obtained with ARACNE [7], an 
unsupervised method adopted by Basso et al. [13], that 
ranks genes according to their mutual information with 
BCL6. It is noticeable that supervised reverse engineering 
methods perform better than unsupervised, a result 
already confirmed in literature [16]. Instead, among 
supervised methods there is a remarkable difference in 
the top 50 ranked genes, where NOIT predicts 29 correct 
targets (60% precision) outperforming other methods 
that predict less than 10 correct targets. Over the first 
200 ranked genes the Bagging method exhibits the best 
performance reaching a correct prediction of 66 targets 
in the first 1000 ranked genes, whereas NOIT predicts 
only 51 and the others less than 45. 

We like to remark that with this experiment we pre- 
dicted an interesting number of BCL6 targets without the 
integrated approach consisting of wide spectrum geno- 
mics experiments adopted by Basso et al. [13] (Figure S6 
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Table 1 Recall of negative selection heuristics at 80% and 60% of precision. 



Method %of Known Positives Recall (Pr = 80%) p-value Recall (Pr = 60%) p-value 

(corrected) (corrected) 



NOIT 


10 


0.179 (± 0.052) 






0.203 (± 0.053) 






PSOL 


10 


0.043 (± 0.020) 


2.0 


10" 5 


0.070 (± 0.031) 


1.2 


10" 4 


BAGGING 


10 


0.066 (± 0.027) 


7.1 


10" 4 


0.132 (± 0.051) 


9.5 


10- 2 


ROGCHIO 


10 


0.036 (± 0.023) 


1.1 


10" 5 


0.053 (± 0.032) 


2.0 


10" 5 


SPY 


10 


0.022 (± 0.011) 


7.3 


10" 7 


0.038 (± 0.017) 


6.4 


10" 7 


PU 


10 


0.013 (± 0.004) 


2.0 


• 10 7 


0.038 (± 0.017) 


6.4 


• 10 7 


NOIT 


30 


0.252 (± 0.060) 






0.384 (± 0.059) 






PSOL 


30 


0.140 (± 0.039) 


5.9 


10" 3 


0.232 (± 0.052) 


5.7 


10" 4 


BAGGING 


30 


0.158 (± 0.047) 


3.5 


10" 2 


0.272 (± 0.067) 


2.9 


10" 2 


ROGCHIO 


30 


0.006 (± 0.002) 


1.8 


10 10 


0.010 (± 0.006) 


1.2 


10" 16 


SPY 


30 


0.123 (± 0.036) 


1.1 


10" 3 


0.200 (± 0.049) 


2.0 


10" 5 


PU 


30 


0.079 (± 0.024) 


3.3 


■ 10" 6 


0.160 (± 0.036) 


3.5 


■ IO -8 


NOIT 


50 


0.294 (± 0.062) 






0.446 (± 0.065) 






PSOL 


50 


0.240 (± 0.056) 


3.6 


10- 1 


0.366 (± 0.064) 


1.3 


10- 1 


BAGGING 


50 


0.245 (± 0.053) 


3.9 


10"' 


0.374 (± 0.069) 


1.8 


io-' 


ROGCHIO 


50 


0.010 (± 0.006) 


1.1 


10" 


0.017 (± 0.011) 


1.3 


10 17 


SPY 


50 


0.228 (± 0.062) 


2.6 


10"' 


0.336 (± 0.067) 


4.0 


10" 2 


PU 


50 


0.230 (± 0.053) 


2.5 


10" 1 


0.320 (± 0.056) 


9.8 


10" 3 


NOIT 


70 


0.278 (± 0.064) 






0.486 (± 0.066) 






PSOL 


70 


0.249 (± 0.063) 


7.4 


10- 1 


0.397 (± 0.071) 


1.1 


io- 1 


BAGGING 


70 


0.304 (± 0.059) 


7.4 


10"' 


0.433 (± 0.071) 


3.7 


10"' 


ROCCHIO 


70 


0.01 1 (± 0.006) 


1.6 


10" 10 


0.019 (± 0.012) 


5.9 


IO 19 


SPY 


70 


0.233 (± 0.064) 


5.0 


10"' 


0.359 (± 0.074) 


2.6 


10" 2 


PU 


70 


0.305 (± 0.066) 


7.4 


10- 1 


0.435 (± 0.068) 


3.7 


io- 1 


NOIT 


90 


0.328 (± 0.066) 






0.511 (± 0.065) 






PSOL 


90 


0.239 (± 0.070) 


1.4 


10" 1 


0.391 (± 0.081) 


4.1 


10" 2 


BAGGING 


90 


0.352 (± 0.065) 


7.5 


10-' 


0.494 (± 0.062) 


8.6 


IO'' 


ROCCHIO 


90 


0.01 1 (± 0.005) 


3.7 


10"' 2 


0.022 (± 0.013) 


4.9 


IO" 20 


SPY 


90 


0.296 (± 0.068) 


7.4 


10- 1 


0.436 (± 0.071) 


1.8 


IO" 1 


PU 


90 


0.337 (± 0.067) 


1 




0.509 (± 0.064) 


1 





The table shows, at different percentage of known positives, the average Recalls of negative at 80% and 60% of precision (lower and upper 95% confidence 
intervals is shown in parentheses). The p-value column (corrected with Benjamini & Hochberg) is the outcome of a t-test performed to check whether the recall of 
NOIT is greater than the recall of another negative selection method. A p-value shown in boldface means that the statistical significance of the test is less than 0.05. 



of [13]). Furthermore, among supervised techniques, the 
NOIT procedure can take advantage from supplemental 
transcriptional information, which is aviable in many 
contexts. 

Conclusions 

The availability of only positive examples affects negatively 
the performance of supervised classifiers. This is particu- 
larly manifested in the context of learning transcriptional 
relationships. We showed that the selection of reliable 
negative examples, a practice adopted in text mining 
approaches, could improve the performance of such classi- 
fiers opening new perspectives in predicting new transcrip- 
tional targets. We introduced a new negative selection 
heuristic, NOIT, that promotes, as negative candidates of a 
transcription-factor, genes that are not regulated indirectly 



through other transcription-factors. The method has been 
tested against other negative selection procedures showing 
that it is able to improve the average performance of 
almost 10%, in terms of AUROC and AUPR, especially 
when the number of known positives is low. We provided 
an example of application in the context of prediction of 
BCL6 direct core targets in normal germinal center human 
B cells by adopting the results of Basso et al. [13]. We 
showed that in the top 50 genes, ranked with the NOIT 
method, 29 targets out of 120 are those experimentally 
demonstrated by Basso et al. [13]. This is promising as 
such targets have been predicted without intersecting the 
results of ChlP-on-chip assays, ARACNe outcomes, and 
transcriptional repression in GC experiments. 

Threats to external validity, concerning the possibility 
to generalize our findings, affect the study as we 



Cerulo ef al. BMC Bioinformatics 2013, 14(Suppl 1):S3 
http://www.biomedcentral.eom/1 471-21 05/1 4/S1 /S3 



Page 12 of 14 




— I — — I — I I — I — I 

0 200 _ 400 600 800 1000 

Top n scored targets 

Figure 7 Top n BCL6 Core targets in GC human B cells predicted with different negative selection methods. The number of true BCL6 
targets as predicted by different negative section procedures and validated with those published by Basso ef al. [13]. 



evaluated the heuristics on a limited number of organ- 
isms. The study can be replicated as the tools are avail- 
able upon request to authors and experimental datasets 
are publicly available. 
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